Install packages & load libraries necessary.
## Warning: package 'BiocGenerics' was built under R version 4.0.3
## Warning: package 'IRanges' was built under R version 4.0.3
## Warning: package 'S4Vectors' was built under R version 4.0.3
## Warning: package 'circlize' was built under R version 4.0.3
## Warning: package 'kableExtra' was built under R version 4.0.3
## Warning: package 'stringr' was built under R version 4.0.3
## Warning: package 'RCurl' was built under R version 4.0.3
## Warning: package 'GSA' was built under R version 4.0.3
## Warning: package 'VennDiagram' was built under R version 4.0.4
## Warning: package 'futile.logger' was built under R version 4.0.4
The Expression Dataset I chose was GSE97356: Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study.
A bit about the dataset
gse <- getGEO("GSE97356",GSEMatrix=FALSE)
Dataset details:
Dataset platform details:
Name of Platform: Illumina HiSeq 2000 (Homo sapiens)
Submission Data Date: Nov 02 2010
Last Date Data was Updated: Mar 27 2019
Organisms Included in Platform: Homo sapiens
Quantity of GEO datasets that use this Platform: 8604
Quantity of GEO samples that use this Platform: 142702
Clean the Data and Map to HUGO Symbols
if(!file.exists('GEOmetadb.sqlite')) getSQLiteFile()
con <- dbConnect(SQLite(),'GEOmetadb.sqlite')
sfiles <- getGEOSuppFiles('GSE97356')
fnames = rownames(sfiles)
reads = read.delim(fnames[1],header=TRUE,sep = ",")
#rename first column to genes, as it's more meaningful than X
colnames(reads)[1] <- c("Gene")
The dimensions of my dataset are: 25830, 325
This means there are 25830 rows and 325 columns (the rows indicate genes and the columns are samples, ignoring the first column which delimits that the rows are genes).
I have three kinds of samples in my 324 subjects, based on the study design. The study design says “324 human samples; 201 never, 82 current, 41 past with PTSD”. I want to get a list of which subjects have never/current/past PTSD diagnosis to compare to my reads dataframe.
for(i in 1:324) {
status <- gse@gsms[[i]]@header["characteristics_ch1"][[1]][1]
case <- gse@gsms[[i]]@header["title"]
if(i==1){
metainfo <- data.frame("status"= status, "case" = case)
}
else{
currentMetainfo <- data.frame("status"= status, "case" = case)
metainfo <- rbind(metainfo, currentMetainfo)
}
}
#for a sanity check ensure correct number of subjects with current/never/past PTSD diagnosis
occurences<-table(unlist(metainfo$status))
print(occurences)
##
## ptsd: Current ptsd: Never ptsd: Past
## 81 201 42
The metainformation found in the gse shows 81 subjects with current PTSD, 201 subjects who never had PTSD, and 42 who had past PTSD diagnosis. This doesn’t match the study design… but I looked at the paper and the published paper shows: “201 with never, 81 current, 42 past PTSD”, so we are actually correctly parsing the data, the study design is incorrectly labelled on GEO.
Filter weakly expressed features from my dataset
I’m going to filter lowly expressed genes using edgeR. Since I am keeping the genes that have at least 1 count per million within smallest group sample size, which is 42 (past PTSD), I’m looking for genes that have at least 1 count per million at least 42 times.
cpms = cpm(reads[,2:325])
#add in the genes associated with rownames
rownames(cpms) <- reads[,1]
qualityReads = rowSums(cpms >1) >= 42
filteredReads = reads[qualityReads,]
The filtered dimensions of the dataset now are: 14184, 325.
During filtering we removed 11646 genes.
Checking for duplicates after filtering
I’m going to see if the dataset has duplicates:
filteredReadsDf <- data.frame(table(filteredReads$Gene))
colnames(filteredReadsDf) <- c("Gene", "GeneOccurence")
duplicates <- filteredReadsDf %>% filter(GeneOccurence > 1)
It looks like I don’t have any duplicated gene names, the number of duplicated gene names is 0 .
Let’s take a look at how the gene names are stored:
head(filteredReadsDf$Gene, 5)
## [1] A1BG A1BG-AS1 A2M A2M-AS1 A2MP1
## 14184 Levels: A1BG A1BG-AS1 A2M A2M-AS1 A2MP1 A3GALT2 AAAS AACS ... ZZZ3
Looking at gene AADACL2-AS1 for example, I see it is AADACL2 Antisense RNA 1. These appear to be gene ids, but I want to convert to HUGO symbols. I’m going to see if I can find the corresponding ensembl ids.
Mapping to Ensembl Ids
grch37 <- useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice") #from https://support.bioconductor.org/p/62064/
ensembl_grch37 = useDataset("hsapiens_gene_ensembl",mart=grch37)
idToEnsembl <- getBM(attributes = c("wikigene_name", "ensembl_gene_id"),
filters = c("wikigene_name"),
values = filteredReads$Gene,
mart = ensembl_grch37)
I can see that I have 0 gene ids that do not map to ensembl ids, so I’m not concerned. I’m now going to convert the ensembl ids to HUGO symbols.
Mapping to HUGO symbols
#first add the ensembl id labels to filteredReads
colnames(idToEnsembl) <- c("Gene", "EnsemblId")
filteredReads <- merge(idToEnsembl, filteredReads, by="Gene")
#now lets get the corresponding HUGO symbols
ensemblToHUGO <- getBM(attributes = c("ensembl_gene_id","hgnc_symbol"),
filters = c("ensembl_gene_id"),
values = filteredReads$EnsemblId,
mart = ensembl_grch37)
Now I have 413 ensembl ids that do not map to HUGO symbols. Lets see how many ensembl ids are duplicated.
Checking for duplicates in Ensembl Ids
filteredReadsDfEnsembl <- data.frame(table(filteredReads$EnsemblId))
colnames(filteredReadsDfEnsembl) <- c("EnsemblGene", "GeneOccurence")
duplicates <- filteredReadsDfEnsembl %>% filter(GeneOccurence > 1)
length(duplicates$EnsemblGene)
## [1] 549
I have 549 duplicated ensembl ids. I wonder how large the overlap between the gene names that are duplicated and those that do not have HUGO symbols is.
#first add the HUGO symbols to to filteredReads
colnames(ensemblToHUGO) <- c("EnsemblId", "HUGOSymbol")
filteredReads <- merge(ensemblToHUGO, filteredReads, by="EnsemblId")
emptyHUGO <- filteredReads %>% filter(HUGOSymbol == "")
duplicatedEnsembl <- data.frame(duplicates$EnsemblGene)
colnames(duplicatedEnsembl) <- c("EnsemblId")
overlapMissingDuplicated <- merge(emptyHUGO, duplicatedEnsembl, by="EnsemblId")
length(overlapMissingDuplicated$EnsemblId)
## [1] 117
It looks like there are 117 gene names that are duplicated ensembl ids and do not have HUGO symbols. The gene names that are missing HUGO symbols make up 3.0299283 % of the dataset. As some of these, 21.3114754 % exactly, have no corresponding HUGO symbol, but all have unique wikigene names, I’m not sure whether these genes are being mapped correctly and don’t feel comfortable removing them.
Additionally, according to lecture 4 slides 26 “if we base our analysis on ensembl gene ids then they are unique elements” as a comment not to necessarily filter out duplicates, and as I have removed lowly expressed genes I feel this is enough at this stage. Instead I’m going to try and preserve all these genes with duplicated names and just create versions of them so theyre unique. This is because the end goal of this assignment is to produce a dataframe with 324 numeric columns where each row has unique HUGO symbols as rownames. Lets see how many HUGO symbols are duplicated:
filteredReadsDfHUGO <- data.frame(table(filteredReads$HUGOSymbol))
colnames(filteredReadsDfHUGO) <- c("HUGOGene", "GeneOccurence")
HUGOduplicates <- filteredReadsDfHUGO %>% filter(GeneOccurence > 1)
length(HUGOduplicates$HUGOGene)
## [1] 1061
There are 512 more duplicated HUGO symbols than ensembl ids. This is a lot.
duplicatedHUGO <- data.frame(HUGOduplicates$HUGOGene)
colnames(duplicatedHUGO) <- c("HUGOSymbol")
duplicatedHUGOFull <- merge(duplicatedHUGO, filteredReads, by="HUGOSymbol")
duplicatedHUGOEnsembl <- merge(duplicatedHUGOFull, duplicatedEnsembl)
length(unique(duplicatedHUGOEnsembl$EnsemblId))
## [1] 549
This shows at least that all the duplicated ensembl ids also have duplicated HUGO symbols, so I’m really contending with the 512 duplicated HUGO symbol genes. I’m going to make the namings of these symbols unique so I can keep them all as discussed above. First I’m going to remove the ensembl ids that do not map to HUGO symbols, of which 117 are duplicated.
filteredReadsFinal <- filteredReads %>% filter(HUGOSymbol != "")
I lost 490 genes that didn’t have HUGO symbols. Now I’m going to make the namings of the HUGO symbols unique.
filteredReadsFinal <- filteredReadsFinal[, c(2,4:ncol(filteredReadsFinal))]
filteredReadsFinal$HUGOSymbol <- make.names(filteredReadsFinal$HUGOSymbol, unique=TRUE)
filteredReadsFinal <- tibble::column_to_rownames(filteredReadsFinal, var="HUGOSymbol")
Apply Normalization
Visualize the data
countDensity <- apply(log2(cpm(filteredReadsFinal)), 2, density)
xlim <- 0; ylim <- 0
for (i in 1:length(countDensity)) {
xlim <- range(c(xlim, countDensity[[i]]$x));
ylim <- range(c(ylim, countDensity[[i]]$y))
}
cols <- rainbow(length(countDensity))
ltys <- rep(1, length(countDensity))
p1 <- plot(countDensity[[1]], xlim=xlim, ylim=ylim, type="n",
ylab="Smoothing density of log2-CPM", main="", cex.lab = 0.85)
for (i in 1:length(countDensity)) lines(countDensity[[i]], col=cols[i], lty=ltys[i])
legend("topright", colnames(dataToPlot),
col=cols, lty=ltys, cex=0.75,
border ="blue", text.col = "green4",
merge = TRUE, bg = "gray90")
We can see some data is not following the same overall pattern as the other samples. Let’s normalize.
Normalizing
Let’s see if we can understand what happened visually.
The boxplot is difficult to interpret, but you do see an adjustment towards the median especially between cases 181 to 223. The count density graph does also show a change in the dataset distribution after normalization. The patients are more tightly clustered, for one.
MDS Plot
plotMDS(d, labels=(metainfo$title), col = c("green","blue", "red")[factor(metainfo$status)])
Here the colour-coding is defined as follows: Blue are patients that have never had PTSD, red are patients that have been diagnosed in the past with PTSD, and green are patients that currently have PTSD.
Since MDS plot represents the distances between samples, we would hope to see past,never and current patients cluster together, but we don’t quite see this.
We want to to visualize our normalized data using a heatmap to get an idea of gene expression patterns.
EdgeR Analysis I will start with a design matrix.
| (Intercept) | metainfo\(status Never </th> <th style="text-align:right;"> metainfo\)status Past | |
|---|---|---|
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 0 | 0 |
I looked at how the “Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study” paper performed differential expression analysis to get a better idea of where to go from here. The differential expression analysis they did was done using “DESeq252 software based on negative binomial generalized linear models, adjusting for age, race and the five cell type proportions (CD8T, CD4T, natural killer, Bcell, monocytes) in discovery and replication cohorts, respectively”.
This gave me reason to believe that I should not be using the Limma package, as it works best with data that has an underlying linear distribution.I decided to try EdgeR (Robinson et al., 2010), a package for data that works best when the data follows a negative binomial distribution.
A negative binomial distribution has a specific shape when plotted, so I can easily verify that my data is not following a linear distribution by seeing if it adheres to the shape of the curve of a negative binomial distribution.
|
|
|
|
Using this new underlying distribution assumption, I want to see if I have any significantly differentially expressed genes now:
## [1] 2226
## [1] 32
Excitingly, I have 2226 genes pass the threshhold p value and 32 pass FDR (Benjamini Hochberg) correction.
Now that I have some differently expressed genes, let’s view these bad boys in a volcano plot!
Volcano Plot
Let’s take a look at upregulated genes.
There are not a lot of upregulated genes (clearly… there’s only two red dots), so let’s see what the volcano plot looks like with downregulated genes.
Seems to be a bit more interesting.
Top Hits
Now I want to make another heatmap and see if there is any clustering that is happening by PTSD: past, never, current condition. I will use ComplexHeatmap (Gu, 2016).
324 samples make the column names difficult to read for the heatmap, but they are ordered by “current”, “never” and “past” PTSD diagnosis as seen by Case_10, Case_14, Case_16, Case_19, Case_32. There doesn’t seem to be extremely obviously clustering by diagnosis, but in the bottom right corner of the heatmap there is a section that’s more blue and preceding it is a section that is more pale red, it’s also more pale red at the beginning.
Using EdgeR 2226 genes pass the threshhold p value and 32 genes pass FDR (BH) correction.
I chose the threshhold p value of 0.05 as it is accepted across scientific literature as significant. There is little benefit to being more stringent especially since I am doing multiple-hypothesis testing anyways.
See Multiple Hypothesis Testing. I used the Benjamini-Hochberg (BH) method, as it is less strict than Bonferroni but still accepted as robust. There were 32 genes that passed the FDR (BH) correction.
See Volcano Plot.
See Top Hits.
Let’s start by getting the upregulated and downregulated genes.
Getting Genes of Interest
I’ll be using EdgeR as that is the method that permitted me to find differentially expressed genes.
## [1] 71
## [1] 2155
There are 71 upregulated genes, and 2155 downregulated genes, which is pretty spicy. I want to get the EnsemblIDs to put them into the g:profiler webpage to find matches so the next step is to get the EnsemblIDs and save them into a text file.
Annotation Data
The annotation data sets I selected on g:profiler (Raudvere et al., 2019) GO biological process, Reactome, and WikiPathways, as they were the datasets we had used in a previous assignment and I felt comfortable interpreting them. The version I used was e102_eg49_p15_7a9b4d6.
Genesets Returned
I used the same significance thresholds as when I performed gene expression analysis (0.05) with multiple-hypothesis testing correction, i.e. with the FDR-corrected p values. This was done using the Benjamini-Hochberg method, and again, the threshhold for p values was 0.05, as this is the standard.
Up Regulation Results
GO biological pathways: T= 30, Q= 36 T∩Q= 4
Reactome: T= 191, Q= 26 T∩Q= 7
Wiki Pathways: T= 0, Q= 0 T∩Q=0
Down Regulation Results
GO biological pathways: T= 4027, Q= 1682 T∩Q= 582
Reactome: T= 234, Q= 1058 T∩Q= 63
Wiki Pathways: T= 162, Q= 800 T∩Q= 52
All Gene Results
GO biological pathways: T= 4027, Q= 1718 T∩Q= 591
Reactome: T= 234, Q= 1084 T∩Q= 63
Wiki Pathways: T= 162, Q= 826 T∩Q= 52
Screenshots of gProfiler results
Upregulated genes
(No Wiki Pathways Results)
The genesets returned top terms related to hydrogen peroxide catabolic process and interferon signaling, generally related to immune system and metabolic processes (Gongora et al., 1999). This suggests the upregulated genes are involved in antiviral defense/immunity.
Downregulated genes
The genesets returned top terms related to protein modification enzymes, chromatin modification enzymes and the EGF/EGFR signal pathway which is involved in growth, differentiation, migration, adhesion and cell survival (WikiPathways, 2020). This indicates the downregulated genes are involved in cell growth and differentiation.
All Genes
The top term genesets for all genes were dominated by the downregulated gene results, again related to cell differentiation/survival.
I chose EdgeR as it had previously returned differentiated genes in the dataset, and the undelying assumptions of the data distributions matched this dataset best.
See Annotation Data.
See Genesets Returned.
See Genesets Returned. The results for all genes together were overall dominated by the the down-regulated genes, putting the genes together only returned 9 more genes in the intersection of T and Q. The terms returned were related to the same overarching themes as for the downregulated genes. However, there’s a clear distinction in term results between up-regulated and down-regulated genes (immune response/antiviral defense and cell differentiation/growth respectively).
See Genesets Returned.
The paper discussed they had found enrichment in “glucocorticoid receptor signaling and immunity-related pathways” but mentioned that these pathways were not robust to FDR correction. The paper also mentioned that gene FKBP5, with EnsembleID ENSG00000096060 was upregulated in patients with PTSD, which is found in the downregulated genes result of my analysis (downregulated in non-PTSD subjects as compared to those with PTSD). The glucocorticoid receptor pathway is a component of the immune response and we can see these immune response pathways come up in my analysis as well (upregulated gene pathways, which would be downregulated in PTSD patients).
There is a lot of evidence supporting immune dysregulation in PTSD patients, which supports the results found in my upregulated gene pathway analysis. For example, the paper “Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression” showed increased inflammation in subjects with PTSD compared to healthy controls (Passos et al., 2015). Another poignant example of this finding was the paper “Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies” which compiled evidence to show that patients with PTSD may be at risk for autoimmune diseases (Boscarino, 2004).
I will conduct non-thresholded gene set enrichment analysis using the ranked set of genes from the previous analysis above.
For GSEA I need a .rnk and a .gmt file. I will get them now.
Going to use the genesets from the baderlab geneset collection from March 1, 2021 containing GO biological process, no IEA and pathways (the .gmt file).
gmtUrl = "http://download.baderlab.org/EM_Genesets/March_01_2021/Human/symbol/"
filenames = getURL(gmtUrl)
tc = textConnection(filenames)
contents = readLines(tc)
close(tc)
# get the gmt that has all the pathways and does not include terms inferred from
# electronic annotations(IEA) start with gmt file that has pathways only
rx = gregexpr("(?<=<a href=\")(.*.GOBP_AllPathways_no_GO_iea.*.)(.gmt)(?=\">)", contents,
perl = TRUE)
gmtFile = unlist(regmatches(contents, rx))
dest = paste0(getwd(), "/data/genesets.gmt")
download.file(paste(gmtUrl, gmtFile, sep = ""), destfile = dest) # save the file
Upregulated Results:
Name: HYDROGEN PEROXIDE CATABOLIC PROCESS%GOBP%GO:0042744
P value: 0.00
ES: 0.83
NES: 3.13
FDR: 0.00
Number of Genes in Leading Edge: 17
Top Gene Associated with this Gene Set: HBZ
Downregulated Results:
Name: REGULATION OF BETA-CELL DEVELOPMENT%REACTOME%R-HSA-186712.2
P value: 0.00
ES: -0.85
NES: -1.77
FDR: 0.00
Number of Genes in Leading Edge: 15
Top Gene Associated with this Gene Set: AKT1
What method did you use?
I used GSEA preranked analysis. Based on previous work done using GSEA preranked analysis, I set the parameters of the analysis to return a geneset size of 15 and left the geneset permutation number at the default parameters as well. I dropped the default 500 maximum geneset size to 200. By reducing the max gene set size I was hoping to get more specific and more meaningful results.
What genesets did you use?
I used genesets from the baderlab geneset collection from March 1, 2021 containing GO biological process, no IEA and pathways.
See: Summary of GSEA Results.
This is a straight forward comparison, and both methods agreed with each other well. The threshholded analysis performed earlier returned very similar information. In the threshholded analysis, for upregulated genes, the genesets returns top terms related to hydrogen peroxide catabolic process and interferon signaling. In our non-threshholded analysis for upregulated genes we returned the exact same GO biological pathway, again showing a pattern in dysregulation of immune response/antiviral defense in PTSD patients. The top terms returned by the downregulated genes in the threshholded analysis were protein modification enzymes, chromatin modification enzymes and the EGF/EGFR signal pathway, which are involved in cell differentiation/growth. The non-threshholded analysis returned information on regulation of beta-cell development which again relates to cell differentiation/growth. This shows a strong correlation between the results of the threshholded and non-threshholded analysis results.
Using your results from your non-thresholded gene set enrichment analysis visualize your results in Cytoscape.
There were 334 nodes and 868 edges in the resulting map when I set the FDR to 0.1 (default settings). This was quite large, as seen below for both the enrichment map results and annotation results, so I decided to use the more stringent FDR cutoff of 0.05.
Enrichment map 0.1 FDR : Enrichment map 0.05 FDR:
The FDR of 0.05 created a cleaner network with both upregulated and downregulated genes available. I had 78 nodes and 67 edges.
I used the default parameters but specified to space the network to avoid overlap, and also indicated “Create singleton clusters” to avoid losing clusters with only a single gene inside. I ended up deciding for my final publication figure to not space the network to avoid overlap, as the annotated network ended up being easier to interpret this way.
Annotated Enrichment map 0.1 FDR :
Annotated enrichment map 0.05 FDR :
Figure 1:
The network described in Figure1 has many themes apparent. The largest are the receptor signaling kit and the pre notch transcription, and the largest upregulated gene theme is immuno antimicrobial peptide.
The receptor signaling kit refers to the WikiPathway Kit Receptor Signaling pathway. These signals are involved i physiological processes including erythropoiesis, lymphopoiesis, mast cell development and function, megakaryopoiesis, gametogenesis and melanogenesis (Kandasamy et al., 2010). As this is indicated as downregulated by the colourcoding of the Cytoscape Enrichment Map legend, we can assume the Kit Receptor Signaling pathway is downregulated. This strongly adheres to the theme of downregulated genes being involved in cell differentiation/growth pathways as seen in the non-thresholded and threshholded analysis.
Another major theme is the pre notch transcription, which comes from the Reactome Pre-NOTCH Transcription and Translation pathway, again following the theme of downregulated genes being involved in cell differentiation/growth pathways (Sean et al., 2002).
The largest upregulated gene theme, immuno antimicrobial peptide, aligns with the upregulation of immune pathway related genes and, in the case of this analysis, supports the literature related to immune dysregulation in PTSD patients.
The most important aspect of the analysis is relating your results back to the initial data and question.
The enrichment results continue to support conclusions/mechanisms discussed in the original paper. There is a clear theme that immune-related genes are differentially expressed in subjects with PTSD vs. not, which was discussed in the original paper. However, while the paper discussed the glucocorticoid receptor pathway, this did not come up as a major theme in my analysis. That being said, according to the paper, genes involved in glucocorticoid receptor function are also inolved in immune response, particularly the main gene the paper discovered FKBP5, which relates back to the main themes from my enrichment map.
See Summary of GSEA Results Q5, but overall the results were consistent across the threshholded and non-threshholded analysis.
See: Interpretation Q2. Papers include: Passos et al., 2015 and Boscarino, 2004.
Further papers supporting the downregulation of immune related genes includes the 2016 paper titled the “Co-morbidity of PTSD and immune system dysfunction: opportunities for treatment” by Neigh, G. N. & Ali, F. F. (2016). The paper discusses the prevalence of PTSD occurrence along with somatic autoimmune and inflammatory diseases.
A dark matter analysis can tell us about the genes that are differentially expressed in our analysis that don’t show up in pathway results. This can sometimes reveal important patterns that would otherwise remain uncaptured.
We’re going to get the pathways we used for our unthreshholded GSEA analysis to start, along with the genes found in our dataset and the ranked genes we ran through GSEA.
gmtFile <- file.path(getwd(), "/data/genesets.gmt")
capture.output(genesets<- GSA.read.gmt(gmtFile),file="gsa_load.out")
names(genesets$genesets) <- genesets$geneset.names
expression <- rownames(dataMatrix)
ranks <- rankInfo
Now we’re getting the results files from the GSEA results folder which holds the upregulated and downregulated genes annotated to pathway.
#get all the GSEA directories
gseaDirectories <- list.files(path = file.path(getwd(),"GSEAresults"),
pattern = "\\.GseaPreranked")
if(length(gseaDirectories) == 1){
gseaDir <- file.path(getwd(),"GSEAresults")
#get the gsea result files
gseaResultsFiles <- list.files(path = gseaDir,
pattern = "gsea_report_*.*.tsv")
#there should be 2 gsea results files
enrFile1 <- read.table(file.path(gseaDir,gseaResultsFiles[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
enrFile2 <- read.table(file.path(gseaDir,gseaResultsFiles[1]),
header = TRUE, sep = "\t", quote="\"",
stringsAsFactors = FALSE,row.names=1)
}
From the enrFile1 and enFile2 we need the following information:
FDRThreshold <- 0.01
#get the genes from the set of enriched pathways (no matter what threshold)
allSigEnrGenesets<- c(rownames(enrFile1)[which(enrFile1[,"FDR.q.val"]<=FDRThreshold)], rownames(enrFile2)[which(enrFile2[,"FDR.q.val"]<=FDRThreshold)])
genesSigEnrGs <- c()
for(i in 1:length(allSigEnrGenesets)){
current_geneset <- unlist(genesets$genesets[which(genesets$geneset.names %in% allSigEnrGenesets[i])])
genesSigEnrGs <- union(genesSigEnrGs, current_geneset)
}
I want to see what the top ranked dark matter genes are.
genesAllGs <- unique(unlist(genesets$genesets))
genesNoAnnotation <- setdiff(expression, genesAllGs)
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
rankedGeneNoAnnotation[1:10,]
The top ranked dark matter gene is TMC5 (Transmembrane Channel Like 5). This is a protein coding gene associated with diseases including Brachydactyly, Type E1, which results in a shortening of fingers (Gene Cards, 2021). This doesn’t seem too relevant to the other results found in the analysis, and likely doesn’t indicate any kind of comorbity with PTSD. The second gene is a microRNA, which could possibly indicate a role in cell development (Gene Cards, 2021). The third gene, SNORA54 is a small nucleolar RNA, and so is SNORD54. HBBP1 is a Hemoglobin Subunit Beta Pseudogene, meaning it is a nonfunctional gene. It makes sense why these aren’t annotated in pathways(Gene Cards, 2021).
The CCL genes are all Motif Chemokine Ligands, (Gene Cards, 2021), meaning they are cytokine genes which are involved in inflammatory and immunoregulatory processes. This relates back to the enrichment findings of genes in characterized pathways.
I want a heatmap of the significant genes that are not annotated to any pathways in entire set of pathways used for the analysis.
genesNoAnnotation <- setdiff(expression, genesAllGs)
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
genesTable <- qlfOutputHits$table %>% filter(FDR < 0.10)
topHits <- genesTable[which(rownames(genesTable) %in%rankedGeneNoAnnotation$HUGOSymbol),]
topHits <- rownames(topHits)
heatmapMatrixTophits <- t(
scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits),])))
orderedMeta <- metainfo %>% arrange(status)
orderedHeatmapMatrixTophits <- heatmapMatrixTophits[, orderedMeta$title]
if(min(orderedHeatmapMatrixTophits) == 0){
heatmapCol = colorRamp2(c( 0, max(orderedHeatmapMatrixTophits)),
c( "white", "red"))
} else {
heatmapCol = colorRamp2(c(min(orderedHeatmapMatrixTophits), 0, max(orderedHeatmapMatrixTophits)), c("blue", "white", "red"))
}
#code for heat map annotation courtesy of prof isserlin:
haColours <- c("darkgreen", "orange", "purple")
names(haColours) <- c("Current", "Never", "Past")
subjectTypeCounts <- table(orderedMeta$status)
ha <- HeatmapAnnotation(df=data.frame(
type = c(rep("Current",(subjectTypeCounts[][1])),rep("Never",(subjectTypeCounts[][2])), rep("Past",(subjectTypeCounts[][3])))),
col = list(type = haColours))
currHeatmap <- Heatmap(as.matrix(orderedHeatmapMatrixTophits),
column_title = "FDR < 0.10 significant genes \n that are not annotated to any pathways",
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmapCol,
show_column_names = FALSE,
show_row_names = TRUE,
show_heatmap_legend = TRUE,
top_annotation = ha)
currHeatmap
There doesn’t appear to be a clear pattern of expression of genes not annotated to any pathways in entire set of pathways used for the analysis among people with PTSD and those without. Let’s take a look at the genes not annotated to any of the pathways returned in the enrichment analysis.
genesNoAnnotation <- setdiff(expression, genesSigEnrGs)
rankedGeneNoAnnotation <- ranks[which(ranks[,1] %in% genesNoAnnotation),]
genesTable <- qlfOutputHits$table %>% filter(FDR < 0.05)
topHits <- genesTable[which(rownames(genesTable) %in%rankedGeneNoAnnotation$HUGOSymbol),]
topHits <- rownames(topHits)
heatmapMatrixTophits <- t(
scale(t(heatmapMatrix[which(rownames(heatmapMatrix) %in% topHits),])))
orderedMeta <- metainfo %>% arrange(status)
orderedHeatmapMatrixTophits <- heatmapMatrixTophits[, orderedMeta$title]
if(min(orderedHeatmapMatrixTophits) == 0){
heatmapCol = colorRamp2(c( 0, max(orderedHeatmapMatrixTophits)),
c( "white", "red"))
} else {
heatmapCol = colorRamp2(c(min(orderedHeatmapMatrixTophits), 0, max(orderedHeatmapMatrixTophits)), c("blue", "white", "red"))
}
#code for heat map annotation courtesy of prof isserlin:
haColours <- c("darkgreen", "orange", "purple")
names(haColours) <- c("Current", "Never", "Past")
subjectTypeCounts <- table(orderedMeta$status)
ha <- HeatmapAnnotation(df=data.frame(
type = c(rep("Current",(subjectTypeCounts[][1])),rep("Never",(subjectTypeCounts[][2])), rep("Past",(subjectTypeCounts[][3])))),
col = list(type = haColours))
currHeatmap2 <- Heatmap(as.matrix(orderedHeatmapMatrixTophits),
column_title = "FDR < 0.05 significant genes \n that are not annotated to any of the pathways returned in the enrichment analysis",
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_dend = TRUE,
show_column_dend = FALSE,
col=heatmapCol,
show_column_names = FALSE,
show_row_names = TRUE,
show_heatmap_legend = TRUE,
top_annotation = ha
)
currHeatmap2
The first heatmap is restricted to genes that are not annotated to any pathways that are significant at P < 0.05, while the second is restricted to genes that are not annotated to any pathways in the enrichment analysis with FDR <0.05. This means that the genes not annotated to any pathways are mostly not very significant, and there aren’t too many significant genes not annotated to pathways in the enrichment analysis.
Lets make a dark matter venn diagram to visualize this information.
A <- genesAllGs
B <- genesSigEnrGs
C <- expression
png(file.path(getwd(),"data/darkMatterOverlaps.png"))
venn.plot<- draw.triple.venn(
area1=length(A),
area2=length(B),
area3 = length(C),
n12 = length(intersect(A,B)),
n13=length(intersect(A, C)),
n23 = length(intersect(B,C)),
n123 = length(intersect(A,intersect(B,C))),
category = c("all genesets", "all enrichment results", "expression"),
fill = c("red","green","blue"),
cat.col = c("red","green","blue")
)
grid.draw(venn.plot)
dev.off()
## png
## 2
The all genesets circle shows the list of all genes in our geneset file, the enrichment results shows all the genes in the significant enrichment results, and the blue expression circle shows all the genes in our expression set. The dark matter genes are the set difference (in one set but not the other) of the expression and all genesets.
Bioconductor Support. (2016). ERROR: No residual degrees of freedom in linear model fits. Retrieved March 15, 2021, from https://support.bioconductor.org/p/59168/
Boscarino JA (2004). Posttraumatic stress disorder and physical illness: results from clinical and epidemiologic studies. Ann N Y Acad Sci.1032:141-53. doi: 10.1196/annals.1314.011. PMID: 15677401.
Gene Cards (2021, March 05). Retrieved from https://www.genecards.org/cgi-bin/carddisp.pl?gene=TMC5
Gongora C, Mechti N. L’interféron: un mécanisme complexe de signalisation [Interferon signaling pathways] (1999). Bull Cancer. 86(11):911-9. French. PMID: 10586107.
Gu, Z. (2016) Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics.
Kandasamy, K., Mohan, S. S., Raju, R., Keerthikumar, S., Kumar, G. S. S., Venugopal, A. K., Telikicherla, D., Navarro, J. D., Mathivanan, S., Pecquet, C., Gollapudi, S. K., Tattikota, S. G., Mohan, S., Padhukasahasram, H., Subbannayya, Y., Goel, R., Jacob, H. K. C., Zhong, J., Sekhar, R., Nanjappa, V., Balakrishnan, L., Subbaiah, R., Ramachandra, Y. L., Rahiman, B. A., Prasad, T. S. K., Lin, J., Houtman, J. C. D., Desiderio, S., Renauld, J., Constantinescu, S. N., Ohara, O., Hirano, T., Kubo, M., Singh, S., Khatri, P., Draghici, S., Bader, G. D., Sander, C., Leonard, W. J. and Pandey, A. (2010). NetPath: A public resource of curated signal transduction pathways. Genome Biology. 11:R3.
Kuan, P., Waszczuk, M. A., Kotov, R., Clouston, S., Yang, X., Singh, P. K., . . . Luft, B. J. (2017). Gene expression associated with PTSD in World Trade Center responders: An RNA sequencing study. Translational Psychiatry, 7(12). doi:10.1038/s41398-017-0050-1
Neigh, G. N., & Ali, F. F. (2016). Co-morbidity of PTSD and immune system dysfunction: opportunities for treatment. Current opinion in pharmacology, 29, 104–110. https://doi.org/10.1016/j.coph.2016.07.011
Passos IC, Vasconcelos-Moreno MP, Costa LG, Kunz M, Brietzke E, Quevedo J, Salum G, Magalhães PV, Kapczinski F, Kauer-Sant’Anna M (2015). Inflammatory markers in post-traumatic stress disorder: a systematic review, meta-analysis, and meta-regression. Lancet Psychiatry, 2(11):1002-12. doi: 10.1016/S2
Ritchie, M.E., Phipson, B., Wu, D., Hu, Y., Law, C.W., Shi, W., and Smyth, G.K. (2015). limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Research 43(7), e47.
Robinson MD, McCarthy DJ and Smyth GK (2010). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139-140
Sean, Egan, Children, T. H., University of Toronto, D. O., Person, Marija, . . . Research, O. I. (2002, November 02). Reactome: Pre-NOTCH Transcription and Translation. Retrieved from https://reactome.org/content/detail/R-HSA-1912408
Uku Raudvere, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, Jaak Vilo: g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update) Nucleic Acids Research 2019; doi:10.1093/nar/gkz369 [PDF].
Wiki Pathways. (2020, June 30). EGF/EGFR Signaling Pathway (Homo sapiens). Retrieved from https://www.wikipathways.org/index.php/Pathway:WP437